Exploratory Data Analysis
Analysis 1
By <Alyssa Shou>
For my question on how crime type varies throughout the day, I started by graphing a full distribution of the number of crimes per hour. Based on this line graph, I saw that there are peaks at 12 am and 12 pm, so I used these two hours as time frames to specifically analyze.
I was also interested in analyzing rush hour time frames because there are more people commuting during those times and thus a higher potential for crime. Morning rush hour is considered 6-9 am and evening rush hour is considered 4-7 pm.
For each time frame, I have graphed the top 10 types of crimes and the top 10 locations at which crimes occur. These graphs are shown below.
After finding the top 10 types of crime at each time, we can observe that the top crimes are very similar throughout the day. On that front, I did not see much variation, but there are some minor differences.
- Midnight is the only time where retail theft is not in the top 10
- Forced burglary is a top 10 crime at morning rush
- Theft from building is a top 10 crime during evening rush hour
- Comparatively, vehicle related crimes (ie. damage & theft) are less common at noon than other times
Finding the top 10 crimes at each time was the first thing I tried since it seemed like the most logical first step in the analysis. I did not anticipate or run into any problems at this step. We can see that even the differences above are not that significant and 9/10 crimes between each time frame overlap. Each of these differences can be explained with common sense logic. For example, forced burglary may be more common during morning rush hour because burglars are aware that houses will be empty as people leave to commute to work.
Public vs. Private Crime Rates
After looking at the top 10 types of crimes and observing no egregious variation, my next step was to analyze crime locations. The top three locations are the same for every time frame: Street, Apartment, Residence. After those three locations, the rest of the locations are clearly less common. I then took the total number of crimes committed among the top 10 crimes and found the proportion of them that were committed in public areas such as street, sidewalk, public garage. etc.
- Morning Rush: 49.6%
- Noon: 45.9%
- Evening Rush: 62.9%
- Midnight: 46.4%
Crimes are the most public during evening rush hour, but about the same degree of publicity in the other three time frames. I believe exploring crime location is important since these are general locations that stakeholders should avoid during specific times. Publicity rate is a successful way to measure safety because public crimes lend themselves to being the most common crimes and are often impulsive actions such as assault/battery and theft.
At first, I tried to keep all the time frames in a side-by-side barplot to keep the visualization cleaner and more brief. I quickly realized that this would not work because the top 10 locations for crimes at different times throughout the day would not be the same. Therefore, there would be parts of the graph with empty or very low bars for locations that are not included in the top 10 locations for all times. This would be misleading to stakeholders so I chose to slice the data and create separate bar graphs for clarity and organization.
Rate of Violent Crimes
Stakeholders are often concerned about crime violence. A common stereotype of Chicago is that we are a very dangerous city and many people unconsciously think of violent crimes when they think of crime in general. Because of this, I created a new column in the dataset that labeled each observation as violent or non-violent (see code file). Violent crimes might include: armed robbery, aggravated domestic battery, homicide first degree murder. Non-violent crimes include: theft pocket-picking, forgery, unlawful entry/trespassing, etc. Below are the percent of crimes in each time frame that are labeled as violent according to Illinois Uniform Crime Report (IUCR) codes. Then I found the percentage of reported crime in each time frame that was labeled violent.
- Morning Rush: 23.69%
- Noon: 18.88%
- Evening Rush: 23.93%
- Midnight: 21.88%
Unfortunately these rates are not negligible and crime does get violent in about 1/5 of data across all times of day. In the recommendations, I will provide action items for stakeholders based off this analysis.
Note: the previous year’s project group also analyzed crime at various hours of day. See Appendix A2 for more information
Analysis 2
By <Grace Chang>
Since I am conducting research on theft, I subset the data so that only data with the Primary Type “Theft” remained; therefore, I could perform my subsequent analyses on only the theft data. I firstly wanted to see how the types of theft varied across the seventy-seven government-delineated Chicago community areas. In order to perform this analysis, I looked for the top twelve community areas with the highest number of observations of theft crimes, and subset the data such that only the data for these twelve areas remained. I focused on only these twelve observations because I wanted to visualize for stakeholders which areas they should pay particular attention to in terms of how often theft crimes are observed there, and how types of theft crimes compare across those twelve areas.
In order to visualize these statistics, I decided on using a stacked bar-plot. Originally, I tried to use a series of line plots—one plot for each community area in the top twelve, featuring the number of thefts of a given type of theft during each month of the year—via the Seaborn FacetGrid method. This did not work as there was a drastic difference between the number of occurrences in certain categories of theft for some community areas during most of the months (likely due to an insufficient amount of observations in several theft type-month groups). As a result, the scale of the plots within the FacetGrid, while they were consistent, did not match up well with the scale of the lines plotted, and the visualization was difficult to view and interpret. The stacked bar-plot removed the month factor, but I realized by attempting the FacetGrid that the month did not matter much. The stacked bar-plot makes it easy to see the proportions of the various theft types and visually compares the frequencies for the different community areas.
Based on this plot, I noticed that theft over 500 dollars dominates theft under 500 dollars in most of the districts, with a notable difference in the ‘West Town’ and ‘Near West Side’ areas, showing that theft of greater financial value is more common than that of less value—this is further supported by the percentages of each type of overall theft, where theft over 500 dollars accounts for 33%, and theft under 500 dollars around 28%. While this is not a significant difference, I concluded that financial theft in general is by far the most common type of theft. The plot also exhibits this trend, where theft of monetary assets account for the greatest proportion of thefts, and retail theft also being common. One surprising observation from this plot is that pick-pocketing only makes up a small part of thefts across the twelve community areas compared to retail theft and financial theft.
Finally, I observed that the four areas with the most theft occurrences were concentrated in the same area—The Loop, Near North Side, Near West Side, and West Town border each other, and this region is also often described as downtown Chicago by visitors and residents. This observation implies that people should be particularly wary in these areas, especially seeing as they are popular areas to live and visit. Based on this context, I further questioned whether there was a relationship between the population density of a community area and the number of thefts in that area. In order to attack this problem, I created a dataset via merging that included the community area names, their corresponding numbers of thefts, and their population densities. I then utilized this dataset to plot a linear regression relating population density and number of thefts.
Community Areas Ranked by Population Density
Community Name Density (sqkm)
0 Near North Side 14863.58
4 Lake View 12752.44
15 Edgewater 12491.89
9 Rogers Park 11672.81
1 The Loop 9897.73
38 Albany Park 9732.13
10 Uptown 9516.37
6 Lincoln Park 8612.96
12 West Ridge 8435.36
55 Hermosa 7940.46
20 Belmont Cragin 7713.71
7 Logan Square 7707.48
It is important to see that while there is a general positive correlation, out of the top four community areas of interest that I discovered as exhibiting the highest numbers of theft, only two of them fall under the top twelve most population dense areas. I formulated this observation by comparing the top four with the top twelve because the top four community areas of interest were originally extracted from the top twelve community areas with the most theft overall. This observation implies that there are likely many extraneous variables in play affecting the number of thefts aside from simply the population density of an area, but based on the trend, we can assume that population density is still a significant independent variable. Seeing as the Near North Side (rank 1 in population density) and The Loop (rank 5 in population density) areas are much more dense than the West Town and Near West Side areas, but border these two areas to the right, a potential explanation could also be that there are many people who travel between these areas, and with them comes a spill-over of theft crimes into these two less population dense areas.
Analysis 3
By <Grace Shao>
I began by investigating the location of CTA crimes. In order for Chicagoans to know which locations they should avoid, the community area with the most crime is important information. I found the 10 community areas with the highest number of crimes, and graphed them below in descending order. The Loop has the most crime by far, outnumbering other community areas by a significant amount. Compared to the #2 most dangerous station, The Loop still had more than 4x the amount of crime.
This graph matches the map of CTA crimes as shown below. I wanted to create this map to visualize where crimes were happening and highlight that The Loop had a very high number of crimes, shown with the high density of points in that area. Since each community area has a different color, it also helps visualize how crimes are spread out across different areas. With this question, I did not anticipate that I would have to make many changes to make the map more readable. I changed the color scale to correspond with the community area, decreased opacity, and increased the zoom to focus on The Loop. This process took a lot of trial and error, especially since plotly was a new library that I had never used before.
Since The Loop represented such a large portion of the crimes committed and is an extremely popular area of Chicago (The Bean, Art Institute, and River Walk are all located there), I wanted to do further analysis on it. Subsetting the data to include only The Loop, I found the most common crimes occurring on CTA stations within that community area. For the top 6 crime types, I found that 3 were theft related and 3 were more physical and violent. Simple battery was the most common crime overall, while pickpocketing was the second most common crime.
I thought it was interesting that theft related crimes were much more likely to happen on the train. However, when looking at physical crimes, a significant proportion happened on the platform. Compared to theft, a higher proportion happened on the platform.
Now that I had established The Loop as the most dangerous community area, I wanted to also pinpoint the most dangerous stations. During my data analysis, I ran into a problem. The dataset did not include which station the crime was committed at – only the longitude and latitude. I had anticipated that the crimes might be clustered and easy to identify on a map, but I quickly found that in areas with many stations it was difficult to visually identify which station the crime belonged to, and would be a very slow process to do manually. I decided instead to import a list of the CTA stations and their coordinates.
For each crime, I found the closest station by longitude and latitude using the Haversine formula, which is recommended for coordinates calculations because it simulates distance on a sphere\(^{1}\). I then found and created a graph of the top 3 most dangerous stations. While there are a few outliers (one CTA crime shows up in a different state) I left them in the map to accurately portray all of the crimes reported. I thought that calculating the nearest station would be a successful approach because it automatically distinguishes between station locations in an efficient way.
- “Haversine Formula to Find Distance between Two Points on a Sphere.” GeeksforGeeks, GeeksforGeeks, 5 Sept. 2022, www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/.
The map below shows the mapped out crimes for each of the top 3 most dangerous stations. Two of these stations existed in The Loop, and the other was more South. All three were Red line stations.
While knowing the top 3 most dangerous stations is important, I also wanted to know the time of day where most crimes occur. This gives people more information in case they are traveling through these stations. To find this information and graph it, I singled out the hour for each crime that happened in the top 3 stations and created a kde density plot to easily identify the peaks in crime. I found that crimes happen most just after midnight around 2 AM and during rush hour. This trend held true for all 3 stations.
Lastly, I wanted to explore just how much more dangerous Roosevelt, the most crime ridden station, was than the average station. In order to find this number, I found the number of crimes within Roosevelt station and compared it to the average number of crimes per station. This would be important for my recommendations to stakeholders to illustrate why safety is important around The Loop.
The Roosevelt station has 7.316203895565685 times more crime than the average station.
Analysis 4
By <Paisley Lucier>
For my analysis I explored the associations between proportion of committed crimes that resulted in arrest and the crime loaction. Since arrest proportion is a critical metric for policing, I used police district to describe location to ensure applicability for stakeholders. Particularly, I also wanted to consider a police district’s sentiment score rating to see if there were associations between a district’s sentiment score and their arrest proportion. As ‘Arrest’ is a boolean value, I had to find representative and effective ways to bin the data, before landing on binning police districts by side of Chicago. Ultimately, within my analysis I both look at general trends in location by considering the ‘side’ of districts, which still maintaining the specificity of district for other analyses in order to offer recommendations for police in specific districts.
Firstly, I generally looked at the proportion of observations that resulted in arrest by each side of Chicago, as well as the average sentiment score by side.
This bar graph on the left shows the proportion of arrested crime across side. The North side has the highest overall arrest proportion–higher than both the South and Central sides, which have very similar proportions. On the right, we can see that the North side also has the highest average police sentiment score. However, the Central side is very close in average score, and the South side’s average score is notably below the other two.
Considering differences in arrest rate for different crime types, I next looked at if the proportion of people arrested for the same primary type of crime differs across side.
The barplot above shows the proportion of observations arrested for each of the 10 crimes with the most overall observances in the data, separated by the side the crime was committed in. Within this bar chart, we can see that the disparities in arrest proportion across sides prevail, though smaller in magnitude. This graph shows us that of the top 10 most frequently recorded crimes, the North Side’s arrest proportion is higher for 8 of them. Additionally, we can see that some crimes have much higher arrest proportions across all sides than others: narcotics and weapons violations have higher arrest proportions than other crimes.
I next considered the association between arrest proportion and average sentiment score for each district for the top 10 crimes. The 5 crimes of Robbery (0.658), Battery (0.598), Theft (0.495), Assault (0.459), and Burglary (0.442) had the highest correlation between a district’s average sentiment score rating and their arrest rate (with the next highest having a correlation coefficient of 0.26). Considering comparable qualities, including arrest rate and trendline (see appendix A.1), I binned these five crimes with the highest correlations by general type (physical assualt and theft) and visualized them below.
The visualization above portrays that for the crime types that are assault and battery, as well as the crime types of burglary, robbery, and theft, there is a moderate, positive, and linear association between the proportion arrested in a district and average sentiment for a district. This means that in general, districts with higher sentiment scores tend to have higher arrest proportions for the types of crimes included above.
Lastly, I wanted to look at which crime-district combinations had the highest disparity in arrest proportion. Below is a dataframe displaying the top 10 most observed crime types and their districts with the highest/lowest arrest proportion, as well as the difference between the highest and lowest proportion.
| Primary Type |
|
|
|
| Weapons Violation |
18.0 |
17.0 |
0.516396 |
| Robbery |
16.0 |
5.0 |
0.134063 |
| Narcotics |
19.0 |
6.0 |
0.116667 |
| Assault |
1.0 |
12.0 |
0.085086 |
| Burglary |
18.0 |
4.0 |
0.083151 |
| Battery |
1.0 |
7.0 |
0.078671 |
| Theft |
1.0 |
7.0 |
0.072987 |
| Criminal Damage |
16.0 |
2.0 |
0.036573 |
| Deceptive Practice |
22.0 |
25.0 |
0.028950 |
| Motor Vehicle Theft |
20.0 |
9.0 |
0.028729 |
From the dataframe above (sorted by the difference in arrest proportion), we can see that, of the top 10 crimes, weapons violation has the highest arrest disparity, followed by robbery and narcotics. These top 3 all have a difference of over 10%. Additionally, 5 out of 10 of the districts with the minimum arrest proportions are on the South side of Chicago.
Conclusions & Recommendations
Our individual analyses answer the broader topic of how to promote personal and community safety and welfare within Chicago. This plays into people’s satisfaction with policing and how to improve these sentiments, along with suggestions on how people should look out for themselves when traveling or living in the city. When examining the various trends yielded by our analyses, it is clear that across theft, general crime, and CTA crime that rush hour and midnight are the most dangerous times. Additionally, theft is very common across Chicago, whether it be on the street, in residential homes, or transportation areas, so stakeholders should be vigilant of our possessions, and can feel less anxious about murder, for example, which only makes up 0.3% of total crimes.
Alyssa’s recommendation
Looking overall at the types of crimes committed during the day, the main takeaway for vistors and residents of Chicago is that crime during the day is just as rampant as crime at night. Speaking on the crimes being committed, our analysis shows that the top 10 most common crimes are consistent from morning rush hour til midnight. In addition, the top 3 locations are identical for all time frames. During evening rush be especially careful because that is when the degree of public crime is highest. Note that crimes are just as public in the morning/noon as they are at midnight so do not make the assumption that you are safer when in broad daylight surrounded by many people. In addition, the analysis shows that crime is just as likely to be violent at any hour of the day. Some people like to carry personal safety equipment like pepper spray when going out at night. This equipment is just as crucial during daytime hours as it is after-dark so I advise stakeholders to take that into consideration.
Stakeholders should keep in mind that I did not analyze every hour of the day. I am using the 8 hours that I did analyze to generalize recommendations that I believe will hold up for the other 16 hours that are not analyzed. The data we analyzed is fairly recent so stakeholders do not need to repeat my analysis. However, based on their occupation, most common commuting routes, and/or most-frequented locations, stakeholders should do extra research for their personal safety in those areas if it may differ from the analysis I’ve presented. Please also take a look at Grace Shao’s CTA analysis if you commute via L train often.
Grace Chang’s recommendation
Next, based on the analysis of theft crimes, it is recommended to stakeholders—anyone who frequents or resides in Chicago—that they should pay more attention to their personal belongings in the region consisting of The Loop, Near North Side, Near West Side, and West Town. This region is popular for travel, as it includes financial districts and tourist attractions such as the Magnificent Mile, the Bean, and more, thus there are many stakeholders who are affected by this result. Furthermore, seeing that 33% of all thefts are thefts of financial assets over 500 dollars, and 28% are thefts of under 500 dollars, it is essential to be attentive about one’s financial possessions. Meanwhile, pick-pocketing, for example, only represents a small percentage—5.16%—of total theft crimes, so stakeholders can be assured that this crime is less common, contrary to common assumptions that pick-pocketing is a heavy concern when it comes to theft.
There are a few limitations that stakeholders should keep in mind: This analysis does not include motor vehicle theft, another common type of theft, because motor vehicle theft has its own subsets of theft types that clash with the general theft category or overwhelm it, such that it became difficult to perform deeper analysis on the general theft category. Additionally, within these community areas there are neighborhoods that can vary in crime rates, but these go beyond the scope of our research and dataset, so stakeholders should do further analysis on the specific neighborhood(s) they are visiting.
Grace Shao’s recommendation
On the CTA, it is clear that the Loop has the highest amount of crime by far, with more than 4x as much crime as the next most dangerous community area. Therefore, in The Loop especially, it is important to stay alert. As for what types of crimes to look out for in this area, pickpocketing and simple assault are the most likely. Theft and is much more likely to occur on the train than the platform, so it is more important watch your belongings closely and keep valuables out of sight on the train. Compared to theft, physical or violent crimes have a higher chance of happening on the platform. Therefore, avoid making contact with others on the platform and leave space between you and others. Since the Loop is a popular tourist area, with landmarks such as the river walk, Art Institute, and Cloud Gate, many stakeholders may be traveling there and it is important to stay alert.
As for specific stations, avoid Roosevelt, 95th/Dan Ryan, and Jackson when possible, especially around midnight and 6-7 pm, when crime rate peaks. To put it in perspective, chances of crime on Roosevelt, the station with the most crime, are 7.32x higher than the average station. By following these recommendations, stakeholders can stay safe while traveling in the city.
Paisley’s recommendation
In regards to the police stakeholders, police should allocate resources, as well as more research into demographic information and district needs to pinpoint the roots of the disparities in arrest rates across districts for the same type of crime–namely the crimes of weapons violations in districts 17 and 18, robbery in districts 5 and 16, and narcotics in districts 6 and 19, which all have arrest proportion disparities of >10% across the named districts.
Particularly, as seen in the associations between a district’s arrest proportion and its sentiment rating, robbery has the highest correlation between a district’s robbery arrest proportion and the district’s police sentiment score, and also is in the top 3 crimes (of the top 10) with the highest arrest disparity. Thus, police should allocate resources to prevention of robbery in district 5, as well as further consider their arrest tactics and get community input to aim for higher sentiment scores. (Note: District 5’s lowest robbery arrest proportion is followed by districts 3, 6, and 12, so this recommendation extends to these districts).